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Next-generation sequencing (NGS) teclinoiogies liave dramatically expanded the breadth 
of genomics. Genome-scale data, once restricted to a small number of biomedical model 
organisms, can now be generated for virtually any species at remarkable speed and 
low cost. Yet non-model organisms often lack a suitable reference to map sequence 
reads against, making alignment-based quality control (QC) of NGS data more challenging 
than cases where a well-assembled genome is already available. Here we show that by 
generating a rapid, non-optimized draft assembly of raw reads, it is possible to obtain 
reliable and informative QC metrics, thus removing the need for a high quality reference. 
We use benchmark datasets generated from control samples across a range of genome 
sizes to illustrate that QC inferences made using draft assemblies are broadly equivalent 
to those made using a well-established reference, and describe QC tools routinely used in 
our production facility to assess the quality of NGS data from non-model organisms. 
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INTRODUCTION 

Until 5 years ago, genomic research was largely confined to a rel- 
atively small number of taxonomic groups in which sequencing 
efforts were focused on a handful of model organisms. Next- 
generation sequencing (NGS) technologies have expanded the 
scope of genomics research by increasing throughput many fold 
compared to traditional Sanger sequencing, at a much lower cost 
per base (Pareek et al, 2011) With genome-scale studies now 
possible in virtually any species within the budget of a standard 
grant, NGS data are being generated in non-model organisms at 
an unprecedented pace. However, NGS can be affected by a range 
of artifacts that arise during the library preparation and sequenc- 
ing processes, which can negatively impact the quality of the raw 
data for downstream analyses. These issues include platform spe- 
cific error profiles, systematic variation in quality scores across the 
sequence read, biases in sequence generation driven by base com- 
position, departure from optimal library fragment sizes, variation 
in the proportions of duplicate sequences introduced by PGR 
amplification bias, and contamination from known and unknown 
species other than the sequencing target (Schmieder and Edwards, 
2011a; Zhou etal, 2013). 

Several software tools have been published that can highlight 
quality issues in NGS data, including low base quality, contamina- 
tion with adapter sequences and biases in base composition (e.g., 
Andrews, 2010; Lohse et al, 2012; Patel and Jain, 2012). Initial 
steps in the quality control (QG) process typically involve assess- 
ing the intrinsic quality of the raw reads using metrics generated 
by the sequencing platform (e.g., quality scores) or calculated 
directly from the raw reads (e.g., base composition). One of the 
most popular tools for the generation of these quality metrics 
is FastQG (http://www.bioinformatics.babraham.ac.uk/projects/ 



fastqc/). FastQG and other similar tools are useful for assessing 
the overall quality of a sequencing run and are widely used in 
NGS data production environments as an initial QG checkpoint. 
Further QG steps commonly performed involve mapping the 
raw reads to a known reference to calculate a range of metrics 
from alignment profiles. These include the mapping rate to the 
expected target, levels of fragment or sequence duplication, and 
estimates of the library insert sizes. These metrics are routinely 
calculated for NGS data derived from model organisms where 
a well-established reference is available and generally included 
in QG reports. However this alignment-based approach is not 
directly possible when sequencing a novel genome. Tools exist 
that can calculate QG metrics such as sequencing errors and over- 
represented sequences in k-mer space without a reference genome 
(Schroder et al, 2010; Keegan et al, 2012; Wang et al, 2012). 
However, these do not generally predict library insert size and 
duplication rate. The preqc component of SGA (Simpson and 
Durbin, 2011; Simpson, 2014) can predict genome characteris- 
tics and QG metrics including fragment length and duplication 
levels but as these metrics are calculated only on a subset of the 
data in k-mer space, duplicate rate for a large dataset can be mas- 
sively underestimated. Also, estimating insert size for mate pair 
libraries is not practical with this approach. Other tools includ- 
ing PRINSEQ (Schmieder and Edwards, 2011b), FASTX-Toolkit^ 
(http://hannonlab.cshl.edu/fastxtoolkit/), and GD-HIT (Fu et al., 
2012) can predict the rate of fragment or read duplication without 
a reference, but have significant limitations. As these techniques 
are based on sequence-clustering algorithms, identical sequences, 



FASTX-Toolkit. [Online]. Available online at: http://hannonlab.cshl.edu/ 
fastxtoolkit/ 
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which might or might not be duplicates, can be erroneously 
removed. In addition, these approaches are both time consum- 
ing and computer memory intensive, and can create bottlenecks 
in a high throughput production environment where rapid and 
efficient QC of raw NGS data is necessary. 

Detecting contaminants in the absence of a reference is equally 
challenging. Published methods exist for the detection of read 
contaminants, e.g., DeconSeq (Schmieder and Edwards, 2011a) 
and FastQ Screen (Andrews, 2011). These tools are based on 
identification of contamination from known sources by opti- 
mized alignment methods. However they fail when the sequence 
of the contaminant is not present in the screening database. 
Similarly, BLAST-based methods are computationally expensive 
when applied to large raw read datasets and cannot be imple- 
mented in a production environment. 

In this study, we show how it is possible to generate a 
draft assembly from the raw data, rapidly and without opti- 
mization, and then use this for the generation of reliable QC 
metrics. To illustrate the utility of this approach, we gener- 
ated benchmark sequence datasets from control samples of 
three model species {Escherichia coli, Arabidopsis thaliana and 
Homo sapiens), for which a high quality reference sequence 
is available, and applied our QC tools to the raw reads. 
By employing both standard mapping-based tools to esti- 
mate PCR duplicate rates and library insert sizes, and new 
approaches such as the taxon-annotated GC-coverage (TAGC) 
plot pipeline (Kumar et al., 2013) to identify contaminants, we 
show broad equivalence of the de novo and reference-based QC 
approaches. 

MATERIALS AND METHODS 

LIBRARY PREPARATION AND SEQUENCING OF CONTROL SAMPLES 

DNA and RNA samples used to generate control libraries were 
obtained from commercial sources (£. coli K12 DNA: Invivogen, 
catalog no. tlrl-ednaef; H. sapiens DNA: CorieU Institute for 
Medical Research, catalog no. NA10857; A. thaliana DNA: 
AMS Biotechnology, catalog no. D1634310; H. sapiens RNA: 
Ambion, catalog no. AM7962). All samples were quantified 
by fluorescence-based measurements (Invitrogen Qubit) and 
assessed for quality using Life Technologies E-gels (DNA) or 
Agilent Technologies Bioanalyzer (RNA) before library prepara- 
tion. 

Genomic libraries with insert sizes of 180, 300, and 600 bp 
were prepared for all three species using lUumina TruSeq DNA 
Sample Prep Kit following the manufacturer's instructions with 
some modifications. Briefly, 3 \ig of genomic DNA was sheared 
using a Covaris S2 instrument (180 bp: duty cycle 10%, inten- 
sity 5, cycles/burst 200, time 420 s; 300 bp: duty cycle 10%, 
intensity 4, cycles/burst 200, time 110 s; 600 bp: duty cycle 5%, 
intensity 3, cycles/burst 200, time 80 s) in 120 |il reactions with IX 
TE buffer, cleaned up with 1:1 ratio Ampure XP beads (Beckman 
Coulter Inc.), and ligated to unbarcoded lUumina paired-end 
adapters. Post-ligation, each library was individually size selected 
to the target size with a Sage Science BluePippin DNA size selec- 
tion system using the 1.5% agarose gel cassette protocol and 
tight cuts at 320 bp (180 bp insert), 440 bp (300 bp insert) and 
740 bp (600 bp insert). Size selected libraries were eluted in 40 \l\ 



volumes and enriched by PCR using library-specific indexed 
primers complementary to the lUumina paired-end adapters. 

The E. coli mate-pair library was constructed using a com- 
bination of Life Technologies SQLiD Long Mate-Paired Library 
Construction Kit and lUumina Mate Pair Library Prep Kit v2 
following the manufacturers' recommendations. 

H. sapiens transcriptome (RNAseq) libraries were prepared 
using lUumina TruSeq RNA Sample Prep Kit v2 foUowing the 
manufacturer's instructions, using 1 |ig total RNA input and 12 
PCR cycles in the enrichment step. 

The mock-contaminated library was created by spiking the 
300 bp insert E. coli library (Eco300) into the 300 bp insert H. 
sapiens library (Hsa300) in proportions 1:20. 

All libraries were checked on a Bioanalyzer High Sensitivity 
DNA Chip (AgUent Technologies) and quantified by qPCR (Kapa 
Library Quantification Kit) before lUumina sequencing on GAIIx, 
HiSeq 2500 or MiSeq platforms as per the manufacturer's instruc- 
tions. Summary of aU libraries is given in Table 1 . Raw sequence 
data were submitted to the Short Read Archive with acces- 
sion number ERP004578 (http://www.ebi.ac.uk/ena/data/view/ 
ERP004578). 

BIOINFORMATICS ANALYSES 
Pre-processing of reads 

All reads were trimmed for adapter sequences and poor qual- 
ity bases (<Q30) using fastq-mcf (http://code.google.eom/p/ 
ea-utils) with the following parameters q = 30, Z = 35 and qual- 
mean = 30. 

Mapping to reference 

Reads were aligned to draft assemblies and reference 
genomes/transcriptomes using BWA 0.6.1 (Li and Durbin, 
2009a,b) with default parameters. Insert size and PCR dupli- 
cation rate metrics were obtained using PICARD^ (v. 1.99) 
and alignment rate was calculated using SAMtools (v.0.1.18). 
The genomes of E. coli K12 MG1655 (http://www.ncbi.nlm. 
nih.gov/nuccore/NC_000913.3),A thaliana TAIRIQ (ftp://ftp.ara 
bidopsis.org/home/tair/Sequences/whole_chromosomes/), and 
H. sapiens hgl9 (http://hgdownload.soe.ucsc.edu/goldenPath/ 
hgl9/bigZips/) were used as references for mapping of reads 
derived from the relevant genomic libraries. For transcriptome 
data, mRNA sequences from UCSC were used as a second 
reference (along with the genome) to compare QC results using 
genomic versus transcriptomic references. 

Contig assembly 

We generated genome assemblies from genomic data using CLC 
Assembly CelP (v.4.2.0, thereafter referred to as CLC) and 
SOAPdenovo2 (Luo et al., 2012), and transcriptome assemblies 
from mRNA reads using CLC and SOAPdenovo-Trans (Xie et al., 
2013). Paired-end and mate-pair data were treated as single-end 
data by combining both reads in a single file. 

Two parameters were defined for SOAPdenovo2 and 
SOAPdenovo-Trans: k-mer size (K) was set to 31 and the 



^PICARD. [Online]. Available online at; http://picard.sourceforge.net/ 
^CLC-bio assembly-cell. [Online]. Available online at: http://vvww.clcbio. 
com/ 
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Table 1 | Summary of benchmark datasets. 
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minimum contig length cutoff was set to 100. The choice of 
k-mer was not optimized, as our aim was to assemble reads into 
longer contigs and not to generate the best assembly. By default 
SOAPdenovo2 reports contigs with minimum length cutoff of 
K*2, but we observed that very small contigs (62 bases, if = 3 1 ) 
were too short for the QC analyses we wanted to perform. No 
parameter optimization was used for CLC because the program 
estimates optimal parameters based on the data. 

Two quality metrics were calculated to describe draft assem- 
blies: % assembly size (the proportion of the reference covered 
by the draft assembly) and % chaff contig size (the proportion of 
the assembly made up of contigs less than or equal to 300 bases) 
(Salzberg et al, 2012). 

Contamination checli 

The proportion of G and C bases (GC content) and the read 
coverage for each contig in the draft assembly of this mixed 
dataset were calculated using the TAGC plot pipeline (available at 
https://github.com/sujaikumar/assemblage; Kumar and Blaxter, 
2011; Kumar et al., 2013). To identify potential contaminants 
de novo, contigs or a subset of contigs from the assemblies of 
the genomic data were compared to the National Center for 
Biotechnology Information (NCBI) non-redundant nucleotide 
database (nt) using megablast program in BLAST (ncbi-blast- 
2.2.28-I-) (Altschul et al, 1990). The hits obtained were then used 
to generate TAGC plots (Kumar et al., 2013), which were reviewed 
manually. 

RESULTS 

OVERVIEW OF OC ASSEMBLIES 

We generated draft QC assemblies for each library using CLC, 
which we used in-house in our QC pipeline, and another, 
open-source assembler, SOAPdenovo2, for comparison. Detailed 
metrics of all the assemblies are given in Table 2. 



ESCHERICHIA COLI GENOMIC DATA 

CLC assembled the E. coli 180 bp insert (EcolSO), 300 base 
pair insert (Eco300), 600 bp insert (Eco600) and 3kb mate-pair 
(Eco3K) libraries into 151, 183, 168, and 246 contigs, respectively. 
Most contigs in each assembly were over 1 kb. SQAPdenovo2 con- 
sistently produced larger numbers of contigs: 3569 contigs for 
EcolSO, 5411 contigs for Eco300, 12,296 contigs for Eco600, and 
589 contigs for Eco3K. 

ARABIDOPSIS THALIANA GENOMIC DATA 

CLC assembled the A. thaliana reads into 40,865, 37,278, and 
25,436 contigs from the 180, 300, and 600 bp insert libraries 
respectively, with fewer than 2% of bases in chaff contigs. 
SOAPdenovo2 produced 643,869, 976,946, and 714,501 contigs 
with 57.62, 81.60, and 59.85% of bases in chaff contigs (contigs 
<300bp) for the 180 bp (Athl80), 300 bp (Ath300) and 600 bp 
(Ath600) libraries respectively. 

HOMO SAPIENS GENOMIC DATA 

Both CLC and SOAPdenovo2 produced highly fragmented 
assemblies from the H. sapiens reads containing millions of con- 
tigs for each library. The chaff contig size proportion was higher 
for the SOAPdenvo2 assemblies (11-20%) than for the CLC 
assemblies (3-5%). The proportion of the genome assembled 
for all libraries was ~75%. Obviously, much greater coverage 
is required to generate full assembly representation of the 3 Gb 
human genome. 

HOMO SAPIENS TRANSCRIPTOMIC DATA 

The H. sapiens RNAseq libraries were assembled using CLC 
and SQAPdenovo-TRANS. CLC generated fewer contigs (or 
transcript fragments; 114,337, 98,113, and 106,392 contigs for 
HsaRNAI, HsaRNA2, and HsaRNA3 respectively) than did 
SOAPdenovo-TRANS (346,228, 228,311, and 339,888 contigs for 
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Table 2 | Assembly metrics. 

Library Assembler Max contig #Contigs Total N50 GC contigs #Contigs Total bases GC contigs Chaff Assembly 

length (bp) bases (bp) (%) >1kb in contigs >1kb(%) size (%) size (%) 

{Mb) >1 kb (Mb) 
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Values for Assembly size and Chaff size are expressed as a percentage of the true genome size. 



HsaRNAI, HsaRNA2, and HsaRNA3 respectively). The propor- 
tion of chaff contigs was relatively low for both assemblers: <2% 
for CLC assemblies and <8% for SOAPdenovo2 assemblies. The 
assembly size for both tools was ~20% of the UCSC mRNA 
reference, indicating significant incompleteness relative to the 
whole human transcriptome, but likely reflecting restricted gene 
expression in the tissue surveyed. 

DUPLICATE RATE 
Genomic libraries 

The mapping rate for the three E. coli libraries was 98% 
when mapped to the standard reference. Mapping to the draft 
CLC assembly produced a similar mapping rate. When the 
SOAPdenovo2 assembly was used as a reference the mapping 
rate was slightly reduced to 95% for the EcolSO and Eco300 
libraries, and to 90% for the Eco600 library (Table 3, Figure 1). 
Fewer PCR duplicates were identified against the reference and 
the CLC assemblies than against the SOAPdenovo2 assemblies. 
The mapping rate for the E.coli 3 kb mate pair library was ~98% 
to the standard reference genome, the CLC assemblies and the 



SOAPdenovo2 assemblies, with consistent duplicate rates across 
methods (Table 3, Figure 1). 

The mapping rate for the A. thaliana libraries was ~96% 
when mapped to the standard reference genome. This was ~90% 
when mapped to the CLC assemblies but dramatically lower 
(~59%) when mapped to the SOAPdenovo2 assemblies. In addi- 
tion, the duplicate rate was predicted to be ~12% for all three 
libraries using the standard reference genome and CLC assem- 
blies, but only 2% when using the SOAPdenovo2 assemblies 
(Table 3, Figure 1). To investigate this discrepancy, we examined 
the AthlSO library data further. All reads which were marked 
as duplicates after mapping to the standard reference genome 
were extracted and mapped to the SOAPdenovo2 assembly: 95% 
of these reads remained unmapped against the SOAPdenovo2 
assembly. We observed that the SOAPdenovo2 assemblies con- 
tains a large proportion of bases in chaff contigs indicating that 
there are many regions of the genome failing to assemble, thus 
fragmenting the assembly. This fragmentation is likely to cause 
an "edge-effect" when reads are aligned with BWA. Internally, 
BWA concatenates all reference sequences (contigs in our case) 
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Table 3 | Mapping statistics for genomic libraries. 



Sample ID #Reads mapped %Reads mapped #Duplicate reads %Duplicate reads Mean insert length SD insert length 
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19.01 


AthSOO 


66,917669 


91.90 


o,ZDy,4z/ 


12.36 


289. OD 


33.49 


Atheoo 


52,667137 


89.53 


D,Uo/,o4J 


11.56 


594.36 


71.53 


HsalSO 


254,730,241 


80.00 


JU,yDD,OOD 


12.16 


159.64 


20.13 


HsaSOO 


257834,392 


80.77 


4U,U4y, Jbo 


15.53 


268.47 


30.5 


HsaeOO 


184,136,326 


72.17 


50,589,333 


2747 


453.27 


135.32 


ALIGNMENT TO SOAPdenovo2 ASSEMBLY 










EcolSO 


2,719,197 


96.31 


26,608 


0.98 


175.69 


22.02 


EcoSOO 


3,185,522 


95.78 


35,419 


1.11 


289.55 


22.76 


EcoOOO 


3,058,613 


90.78 


58,099 


1.90 


565.64 


76.81 


Ecoli-SK 


26,696,300 


98.05 


10,338,237 


3796 


2715.88 


305.67 


AthlSO 


35,381,329 


59.19 


492,038 


1.39 


163.44 


19.01 


AthSOO 


43,453,271 


59.68 


987491 


2.27 


291.18 


25.74 


Atheoo 


34,367686 


58.43 


1,233,130 


3.59 


599.55 


4713 


HsalSO 


212,773,751 


66.83 


20,335,478 


9.56 


158.82 


19.74 


HsaSOO 


214,568,006 


6721 


27419,943 


12.78 


268.52 


29.82 


HsaBOO 


158,562,678 


62.15 


39,751,153 


25.07 


466.71 


124.63 
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FIGURE 1 I Estimation of duplicate rate for paired-end genomic libraries. Duplicate rates are plotted for each species and each target size using reads 
mapped against the species reference, CLC, and SOAPdenovo2 assemblies. 
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into one long, contiguous sequence and a read can be mapped 
to the junction of two adjacent reference sequences. In tliis case 
BWA will flag the read as unmapped (http://bio-bwa.sourceforge. 
net/). This leads to an apparent reduction in both the mapping 
rate and the duplicate rate through the exclusion of reads aligned 
to the edge of contigs in the calculations of the PGR duplicate rate. 
To test this, we altered the k-mer used by SOAPdenovo2 in order 
to assemble the reads into longer contigs. We used KmerGenie 
(Chikhi and Medvedev, 2013) to select optimized parameters for 
the assembly. This suggested using a k-mer size of 45 and coverage 
cutoff of 2. We ran SOAPdenovo2 again using these parameters, 
which produced an improved assembly with 146,503 contigs and 
an N50 of 5748 bases. Reads for AthlSO were mapped to this 
assembly, which yielded a 25% increase in the mapping rate and 
a 3 % increase in the duplicate rate. When we mapped reads 
which were flagged as duplicates, against the standard reference 
genome to the improved assembly, we also observed an increase 
in the mapping from 5% to 20% (i.e., 80% of these remained 
unmapped). 

For the H. sapiens data, the mapping rate was 95% against the 
standard reference genome. This reduced to 80% against the CLC 
assemblies for HsalSO and Hsa300. The mapping rate was 70% 
for Hsa600 against the CLC assembly. The duplicate rate for these 
data was ~25% when reads were aligned against the standard 
reference and the SOAPdenovo2 assemblies, and slightly higher 
(27%) against the CLC assembly (Table 3). 

Transcriptome libraries 

For the transcriptome (RNAseq) libraries, mapping results 
against the reference genome, reference transcriptome and the 
two assemblies were very similar (Table 4). The mapping rate was 
~90% for alignment to the CLC assemblies and reference tran- 
scriptomes, and 94% for alignment to the SOAPdenovo-TRANS 
assemblies. The duplicate rate was consistent across all three 
mapping approaches for each replicate library (Figure 2). Some 
differences were observed between replicates, which can be 
attributed to differences in coverage (Table 1). 



INSERT SIZE DISTRIBUTION 

Insert size distributions estimated for the genomic libraries, 
including the mate-pair library, against the standard refer- 
ence closely matched the target, for all insert sizes and species 
(Figures 3-6). Distributions estimated against the draft assem- 
blies gave very similar results. Mapping to the SOAPdenovo2 draft 
assemblies yielded lower numbers of mapped pairs, but gave sim- 
ilar insert size estimates. Similarly, insert size distributions for the 
RNAseq libraries were consistent across replicates and assemblies, 
and consistent with the reference-based estimates (Figure 7). 

CONTAMINATION CHECK 

Approximately 4% of the reads derived from the mock E. 
coli-H. sapiens library (EH-Mock) mapped to the E. coli 
reference (Table 5). TAGC plots generated for the CLC and the 
SOAPdenovo2 assemblies using all contigs revealed two clusters 
(Figure 8): a large cluster with read coverage between 1 and 500 
and GC between 20 and 80%, and a small, well-defined cluster 
with coverage greater than 100 and GC between 40 and 60%. 
Contigs in the large cluster were annotated with BLAST matches 
from the taxonomic order Primates, and those in the smaller 
cluster were annotated with matches from the taxonomic order 
Enterobacteriales. Overall, ~4 and 3% of the raw reads mapped 
to the small cluster contigs in the CLC and SOAPdenovo2 assem- 
blies, respectively (Table 5). TGAC plots generated from a subset 
of randomly selected contigs (5%) resolved SOAPdenovo2 contigs 
into Enterobacteriales and Primate-annotated clusters but failed 
to identify distinct but clusters among CLC contigs (Figure 9). 
4.5% of reads mapping to the randomly selected SOAPdenovo2 
contigs mapped to contigs annotated as Enterobacteriales, while 
this figure was only 0.04% for CLC contigs (Table 5). 

DISCUSSION 

We have described a rapid assembly and QC protocol that permits 
robust estimation of a number of key QC metrics (duplication 
rates and library insert sizes) in the absence of a high quality 
reference genome. We tested the performance of this protocol by 



Table 4 | Mapping statistics for RNAseq libraries. 


Sample ID #Reads mapped 


%Reads mapped 


#Dupllcate reads 


%Duplicate reads 


Mean insert length 


SD Insert length 






HsaRNAl 102,189,446 


91.28 


50,803,861 


49.72 


215.99 


72.03 


HsaRNA2 72,036,569 


90.43 


32,114,542 


44.58 


184.88 


64.15 


HsaRNA3 91,195,551 


91.22 


45,907384 


50.34 


211.94 


6764 


ALIGNMENT TO TRANSCRIPTOME 












HsaRNAl 101,085,843 


90.29 


48,569,451 


48.05 


221.99 


73.19 


HsaRNA2 72,043,659 


90.44 


30,536,011 


42.39 


191.66 


6762 


HsaRNA3 90,151,071 


90.18 


43,899,305 


48.70 


21769 


68.43 




HsaRNAl 102,553,145 


91.60 


50,224,113 


48.97 


214.16 


6759 


HsaRNA2 72,815,405 


91.41 


31,802,909 


43.68 


184.07 


62 


HsaRNA3 91,194,129 


91.22 


45,376,410 


49.76 


210.12 


63.27 


ALIGNMENT TO SOAPdenovo-TRANS ASSEMBLY 










HsaRNAl 105,695,602 


94.41 


50,837146 


48.10 


213.49 


674 


HsaRNA2 74,505,094 


93.53 


31,718,199 


42.57 


183.55 


61.72 


HsaRNAS 94,064,116 


94.09 


45,826,021 


48.72 


209.46 


63.14 
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FIGURE 2 I Estimation of duplicate rate for transcriptomic libraries. Duplicate rates are plotted for each replicate RNAseq library using reads mapped 
against the human genome and transcriptome references, CLC, and SOAPdenovo2 assemblies. 
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FIGURE 3 I Estimation of insert sizes for £. coli paired-end genomic libraries. Insert sizes are plotted for each target size using reads mapped against the 
E. coli reference genome, CLC, and SOAPdenovo2 assemblies. 
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FIGURE 4 I Estimation of insert sizes for A. thaliana paired-end genomic libraries. Insert sizes are plotted for each target size using reads mapped against 
the A. thaliana reference genome, CLC, and SOAPdenovo2 assemblies. 



comparing QC metrics derived from analyses against the unop- 
timized assemblies to those derived from mapping to reference 
assemblies, using E. coli, A. thaliana, and H. sapiens raw data. 
Because speed is essential within a production environment, we 
currently use CLC to generate preliminary assemblies in our cur- 
rent QC pipeline. To test our strategy using open-source software, 
we also included SOAPdenovo2 as it is widely used and can 
assemble larger genomes using significantly reduced time and 
memory relative to other assemblers (Li et al., 2010). We recog- 
nize that other assembly tools may give different results in this 
context but we note that comparing our results across a range of 
assemblers is outside the scope of this study focused on describing 
methodology established and routinely used in our facility. 

Despite the fact that some of the assemblies were fragmented 
(for example, millions of contigs for human samples), we found 
that QC results such as insert size and detection of contami- 
nants derived from alignment of data to QC assemblies using 
CLC were equivalent to those obtained after alignment to the 
reference genome. In particular, insert size estimates predicted 
from alignment to CLC and reference genome assemblies were 
highly similar for both genomic (including the mate-pair library) 
and transcriptome libraries. These insert size frequency plots 
(Figures 3-6) are very helpful for general data QC, but also for 
directing filtering strategies to remove reads from very short 
inserts (a common finding in lUumina libraries generated using 



PCR), and in estimating parameters for full, optimal assem- 
bly. The duplicate rate estimates obtained against the reference 
and CLC assemblies were essentially identical across all genomic 
libraries. For RNAseq samples, we found that mapping to tran- 
scriptome and genome provided similar results, and that dupli- 
cate rate estimates were dependent on coverage. 

The QC metrics estimated from the SQAPdenovo2 assem- 
blies gave essentially the same results, except for the A. thaliana 
genomic libraries, where mapping rates were significantly lower 
than predicted by both reference-based mapping and the CLC 
approach. SOAPdenovo2 generated a larger number of short con- 
tigs in all assemblies, but especially for the A. thaliana libraries, 
perhaps because of features characteristic of plant genomes such 
as families of highly similar genes and repeats (Claros et al., 
2012). As a result, many reads mapped to contig edges, remained 
unmapped or mapped to a different contig than their mate. 
Optimizing assembly parameters for SOAPdenovo2 improved 
mapping rates and gave estimates closer to those derived from 
mapping to the reference genome. Although the exact reasons 
for low duplicate rates assessed during mapping against the 
SOAPdenovo2 assemblies remain unclear, our data suggest that 
an excess of small contigs can lead to an underestimate of the 
duplicate rate. 

Our contamination check protocol successfully identified the 
presence of exogenous reads in the mock-contaminated human 
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FIGURE 5 I Estimation of insert sizes for H. sapiens paired-end genomic libraries. Insert sizes are plotted for each target size using reads mapped against 
the H. sapiens reference genome, CLC, and SOAPdenovo2 assemblies. 
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FIGURE 6 I Estimation of insert sizes for E. coli mate pair library. Insert sizes are plotted using reads mapped against £ coli reference genome, CLC, and 
SOAPdenovo2 assemblies. 
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Table 5 | Mapping statistics for the moclc-contaminated E. co//-human genomic library. 



Reference #Sequences Total bases %Reference #Reads %Reads 

bases mapped mapped* 





Human hg19 


25 


3,157607875 


100.00 


258,647575 


91.23* 


E.coli K12 MG1655 


1 


4,705,957 


100.00 


11,536,466 


4.07* 


CLC all contigs 


2,316,927 


2,081,896,905 


65.83 


224,568,185 


79.21* 


CLC contigs annotated as enterobacteriales 


159 


4,464,778 


94.88 


11,514,531 


4.06* 


CLC contigs annotated as primates 


953,061 


649,718,210 


20.58 


68,061,165 


24.01* 


SOAPdenovo2 all contigs 


4,548,560 


2,269,464,719 


71.77 


192,053,787 


6774* 


SOAPdenovo2 contigs annotated as enterobacteriales 


4869 


3,579,938 


76.07 


8,421,772 


2.97* 


SOAPdenvo2 contigs annotated as primates 


3,697,548 


1,848,428,560 


58.54 


146,728,858 


51.75* 


ALIGNMENT TO 5% CONTIGS SELECTED AT RANDOM ^^^^B ^^^^^^V^ 


CLC contigs in subset (5%) 


115,846 


104,264,699 




15,840,263 


5.59* 


CLC contigs in subset (5%) annotated as enterobacteriales 


4 


2248 


0.05 


7073 


0.04* 


CLC contigs in subset (5%) annotated as primates 


47440 


32,471,856 


1.03 


5,027091 


31.74** 


SOAPdenovo2 contigs in subset (5%) 


227428 


113,433,961 




9,675,703 


3.41** 


SOAPdenovo2 contigs in subset (5%) annotated as enterobacteriales 


227 


184,821 


3.93 


435,254 


4.5** 


SOAPdenvo2 contigs in a subset (5%) annotated as primates 


184,892 


92,556,545 


2.93 


7382,636 


76.3** 



*Mapped reads is relative to ttie total number of reads generated from the library. 

** Mapped reads is relative to the total number of reads mapped to the subset (5%) of randomly selected contigs. 
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FIGURE 8 I Taxon-annotated GC-coverage (TAGC) plots for the 
mock-contaminated E co//-human library. (A) TAGC-plot generated after 
alignment to the CLC assembly; (B) TAGC-plot generated after alignment to 
the SOAPdenovo2 assembly. Individual contigs are plotted based on GC (X 



axis) and read coverage (Y axis, logarithmic scale). Contigs are colored 
according to the taxonomic order of the best megablast match to the NCBI nt 
database (with £-value < le-50). Contigs without an annotated BLAST match 
are shown in gray. 




FIGURE 9 I Taxon-annotated GC-coverage (TAGC) plots from a subset 
of randomly selected contigs for the mock-contaminated E. 
co//-human library. (A) TAGC-plot generated after alignment to a random 
subset of contigs (5) from CLC assembly; (B) TAGC-plot generated after 
alignment to to a random subset of contigs (5) from the SOAPdenovo2 



assembly. Individual contigs are plotted based on GC (X axis) and read 
coverage (Y axis, logarithmic scale). Contigs are colored according to the 
taxonomic order of the best megablast match to the NCBI nt database 
(with E-value < le-50). Contigs without an annotated BLAST match are 
shown in gray. 



library. The TAGC approach clearly identified two clusters of con- 
tigs showing contaminating Enterobacteriales sequences against a 
primate background in both draft assemblies. The proportion of 
contaminating reads estimated against the E. coli reference was 
in very close agreement with the estimate from the CLC assem- 
bly, while the same approach using SOAPdenovo2 assemblies 



slightly underestimated the proportions of contaminated reads. 
These results suggest that our protocol may also be used to quan- 
tify contamination levels, although accuracy may vary with the 
assembly method and the proportion of contigs used to gener- 
ate TGAC plots. Significant amount of time and compute for this 
screening was taken by BLAST to query all contigs against NCBI 
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nucleotide database (nt). Repeating the analysis using a subset of 
randomly selected contigs reduced this time to several folds and 
also correctly identified the presence of Enterobacteriales reads in 
the data but gave variable estimates of the proportions of contam- 
inating reads, presumably due to stochastic errors due to random 
sampling. Thus, while sub-sampled TGAC plots may be effec- 
tive to detect the presence of exogenous reads, we recommend 
using all contigs to maximize the power to detect and quantify 
contaminants. 

CONCLUSIONS 

QC of raw reads is an essential first step in the analysis of NGS 
data. Mapping-based approaches are accurate and time efficient 
for collecting QC metrics such as duplicate rate and insert size, 
but the lack of reference sequences for non-model species has 
been a major bottleneck. Here, we use the power of rapid de 
novo genome and transcriptome assembly to generate contig 
sets to which the original reads can be mapped. The metrics 
derived from the unoptimized, CLC draft assembly and mapping 
approach are closely similar to those from reference genome map- 
ping, and serve to deliver equivalent QC data. While our approach 
successfully estimated the insert size distribution of a 3 kb mate 
pair library prepared from £ coM, we recognize that mate-pair 
libraries can be challenging to assemble, especially when the vir- 
tual insert size is large and/or the target genome is complex. These 
will typically be generated alongside a range of standard libraries 
with different insert sizes. In practice we recommend to map the 
reads derived from the mate pair library against the draft assembly 
of contigs generated from the standard libraries and calculate an 
estimate of insert length and duplicate rate from this alignment. 

The use of SOAPdenovo2 as an alternative assembler was gen- 
erally successful and gave similar metrics to CLC in most cases. 
However, this was not true for predicting the duplicate rate. 
Assembling difficult genomes such as those of plants can lead to 
an underestimate of the true duplicate rate. In this case, some 
parameter optimization (e.g., k-mer size) can help in generat- 
ing more robust QC metrics. While this approach is likely to be 
impractical in a production environment where different libraries 
may require different assembly parameters, other assemblers may 
perform better in this context and further work is needed to 
identify suitable alternatives to CLC. 

We recommend GC, coverage and BLAST-based similarity 
screening of preliminary assemblies for exclusion of contaminat- 
ing data before continuing with downstream analyses. This is 
easily achieved through the use of TAGC plots. For contamina- 
tion check, we used all contigs as input to the TAGC pipeline. 
Random selection of contigs can be useful to speed up the process 
of screening for contaminants but may significantly reduce the 
power to obtain quantitative estimates of contaminating reads. 
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